Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP #7907

Closed
wants to merge 1 commit into from
Closed

WIP #7907

wants to merge 1 commit into from

Conversation

tcsc
Copy link
Contributor

@tcsc tcsc commented Aug 13, 2021

This is intended as a soundness check on work to address issue #7606.

Goal

To have a node (ssh for now, but applicable to other services as well) to notice when the proxy configuration changes and adapts automatically.

Context

Imagine you have a cluster with the a node connected in via a tunnel on port 3024

 ┌──────┐            ┌───────┐            ┌──────┐
 │ Node ├────────────► Proxy ├────────────► Auth │
 └──────┘            └───────┘            └──────┘
  auth_servers:       public_addr:
  - example.com:3080    example.com:3080
                      tunnel_public_address:
                        example.com:4024

Now imagine you change the proxy config so that tunnel_public_address is example.com:4024. You either restart the proxy, or reload the proxy config with a SIGHUP.

...and the node doesn't reconnect to the proxy, because even though the auth_server address hasn't changed the node has cached the old tunnel_public_address and keeps trying to connect to that.

You can always restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.

This PR's approach

I've initially attacked this by trying to leverage the discovery & re-connection machinery that already exists in the node (i.e. the AgentPool). Originally, the agent pool was given the tunnel_public_address it the node discovered on startup and always attempted to reconnect to that.

The WIP code in this MR removes that static config, and replaces it with a callback that will furnish the AgentPool with a list of potential proxy addresses. In the proof-of-concept handler it polls webapi/find to get the current tunnel_public_address value, and adds a new Agent to the AgentPool for that address if necessary. This is based on the assumption that once we add it to the agent pool, an automatic re-connection will automatically fall out of the existing machinery.

This works as expected up to a point,

...but

It turns out there is a bunch of stuff in the AgentPool that assumes the original tunnel_public_address is still available (see Agent.AccessPoint) in order to process the new connection. This makes perfect sense in the context of the proxy Discovery, in that it's designed to handle locating and adding extra proxies to the pool. It models the notion of moving a tunnel proxy to a new address somewhat less well.

This extends past the AgentPool as well. There are several places in the larger application that have a baked-in assumption that the Client wrapping the initial connection to the auth server is special, and the corresponding tunnel parameters will be valid (if not actually connected) for the lifetime of the process.

We can, of course, track down all of the state information that needs updating and do so, but it is beginning to sound like I am making this change at the wrong level.

Alternative approaches

Tunnel watchdog

We could somehow detect the address change (e.g. polling webapi/find occasionally and noting any changes in the reported addresses) and automatically restarting the entire teleport instance on a change to the tunnel_public_address, rather than trying to manage the change internally.

This is starting to look like a sensible option, as so much of the teleport process assumes that the initial tunnel location will not change. I've started working on a proof-of-concept implementation that polls webapi/find, but this becomes problematic when a cluster might have multiple, independently configured proxies that can be hit by any given poll. Some extra work would be to be done in order prevent the SSH node restarting in a legitimate configuration.

Refactoring out

A larger-scale refactor of the AgentPool (and friends) to try and make it easier to handle this "proxy moved" situation. This will essentially mean that the initial client connection to the proxy becomes just another client in the pool, and can be rotated out of the AgentPools proxy set like any other.

....?

What do i want out of this discussion?

Basically, either confirmation that the AgentPool is/is not the correct place to try an implement this change. And, if it's not, some alternative approaches to the problem

@tcsc
Copy link
Contributor Author

tcsc commented Aug 16, 2021

@awly, @fspmarshall suggested you might be able to lend an opinion here. Any suggestions gratefully accepted.

@tcsc
Copy link
Contributor Author

tcsc commented Aug 18, 2021

Going with the higher-level restart for now.

@tcsc
Copy link
Contributor Author

tcsc commented Sep 13, 2021

Closed in favour of #8102

@tcsc tcsc closed this Sep 13, 2021
@zmb3 zmb3 deleted the tcsc/issue7606 branch April 26, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant